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RELIABILITY  FOR  THE  LAW  OF 
COMPARATIVE  JUDGMENT1 

In  studies  using  the  method  of  paired  comparisons  and  the  law  of 
rfyipa.-mt.ive  Judgment,  it  is  desirable  to  determine  the  reliability  of 
the  scales  which  are  obtained.  For  a  given  set  of  data  one  might  like 
to  know  the  extent  to  which  the  law  of  comparative  Judgment  is  successful 
in  accounting  for  the  total . variance  in  the  data. 

Mosteller  (15)  has  outlined  a  chi- square  test  of  the  agreement 
between  the  fitted  proportions  (  p*  )  and  the  observed  proportions 
(  p  );  such  a  test  labels  the  discrepancy  between  observation  and  theory 
as  either  "significant "  or  "non-significant"  but  does  not  indicate  whether 
the  variance  accounted  for  by  the  theory  is  large  or  small  in  relation 

to  the  total  variance  in  the  data. 

This  property  of  significance  tests  is  well  known  and  has  been 
clearly  stated  by  Cochran  (3)  in  his  discussion  of  the  chi-square  test. 
"The  power  of  the  test  to  detect  ea  underlying  disagreement 
between  theory  and  data  is  controlled  largely  by  the  size  of  the 
sample.  With  a  small  sample  an  alternative  hypothesis  which  de¬ 
parts  violently  from  the  null  hypothesis  may  still  have  a  small 

/  i 

probability  of  yielding  a  significant  value  of  X4"  .  In  a  very 
large  sample,  small  and  unimportant  departures  from  the  null  hy¬ 
pothesis  are  almost  certain  to  be  detected." 

If  the  sample  is  small  then  the  X2  test  will  show  that  the  data 
are  "not  significantly  different  from"  quite  a  wide  range  of  very 

1Thanks  are  due  to  Ledyard  Tucker  and  Frederic  Lord  for  valuable 
suggestions  on  the  develosraent  presented  here. 
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different  theories,  while  if  the  sample  is  large,  the  X  test  will 
show  that  the  data  are  "significantly"  different  from  those  expected 
on  a  given  theory  even  though  the  difference  may  be  so  very  slight  as 
to  be  negligible  or  \in important  on  other  criteria.  Fisher  (6)  gives 
a  good  illustration  of  this  point  in  Ms  analysis  of  Weldon's  data  on 
dice  throws.  If  we  test  the  theory  that  a  throw  of  5  or  6  has  a 
probability  of  l/3,  then  chi-square  for  Weldon's  data  is  very  large, 
with  p  of  .0001.  However,  a  very  slight  change  in  the  theory  --  from 
a  probability  of  .3335  to  a  probability  of  .3377  —  gives  a  quite 
reasonable  chi-square  with  a  p  value  of  .3  or  .4. 

In  order  to  proceed  appropriately  in  any  scientific  investigation 
it  is  likely  to  be  necessary  to  answer  two  different  questions: 

1.  Is  it  reasonable  to  say  that  random  variation  accounts  for 
the  difference  between  theory  and  data? 

2.  How  large  is  this  difference  relative  to  the  variation  that 
is  accounted  for  by  the  theory? 

In  studying  the  applicability  of  the  law  of  comparative  judgment, 
variance -component  and  analysis- of -variance  techniques  can  provide 
appropriate  answers  to  these  questions  by  methods  outlined  below  and 
there  applied  to  two  sets  of  data  on  handwriting  specimens  and  to 
Mosteller's  (15)  baseball  data. 

The  data  of  the  example. 

The  Handwriting  specimens  were  chosen  from  the  Ayres  (1)  hand¬ 
writing  scale.  This  scale  consists  of  a  series  of  handwriting  specimens 
of  nine  different  scale  levels,  numbered  from  10  (the  lowest)  to  90 
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(the  highest).  Each  of  these  scale  values  is  represented  by  three 
specimens,  a  "vertical"  style  (a),  a  normal  slant  (b),  and  an  extreme 
slant  (c).  Thus  the  scale  consists  of  27  different  handwriting  speci¬ 
mens.  In  conventional  use,  a  handwriting  specimen  to  be  scaled  is 
Judged  to  be  like  one  of  the  scale  specimens  or  to  fall  between  two 
of  them.  Thus,  specimens  can  be  scaled  10  to  90.  The  extremely  bad 
or  good  ones  might  be  either  below  10  or  above  90  respectively.  Nine 
of  these  handwriting  specimens  were  chosen  for  the  present  experiment: 

50a,  50b,  50c,  70a,  70b,  70c,  80a,  80b,  and  80c  (shown  in  Figure  l). 

The  56  possible  pairs  for  these  nine  specimens  were  arranged  in  a 
booklet,  with  instructions  for  the  Judge  to  pick  the  better  member  of 
each  pair.  It  is  interesting  to  note  that  one  can  easily  develop  a 
discussion  in  a  class  in  measurement  to  indicate  that  there  are  numerous 
criteria  on  which  it  is  possible  to  Judge  these  handwriting  specimens; 
the  class  will  rather  readily  reach  the  conclusion  that  any  set  of 
Judgments  would  be  meaningless,  highly  unreliable,  and  undupiicatable 
unless  one  defined  in  great  verbal  detail  exactly  what  characteristic 
was  to  be  Judged,  instead  of  simply  using  the  term  'better  handwriting." 

In  the  late  1930's  this  schedule  was  given  without  preliminary  dis¬ 
cussion  of  the  problem  to  100  students  at  the  University  of  Chicago, 
and  in  the  late  '^O'a  it  was  given,  again  without  preliminary  dis¬ 
cussion,  to  100  students  at  Princeton  University.  The  data  (  p  ,  the 
observed  proportions,)  are  shown  in  Table  1.  The  agreement  between 
these  two  sets  of  judgments  for  100  people  taken  in  different  institutions 
about  ten  years  apart  is  rather  striking. 


Selected  Specimens  from  the  Ayres  Handwriting  Scale. 


Tbe  tvo  sets  of  scale  values  obtained  from  utilizing  the  law  of 
comparative  judgment  as  stated  by  Thur stone  (l4,  15)  are  shown  in  Table 
2.  In  both  of  these  scales,  stimulus  50a  (the  poorest  one)  has  been 
chosen  as  having  a  scale  value  of  zero.  The  fitted  proportions  (  p*  ) 
computed  from  these  scale  values  are  given  in  Table  5.  The  scale  values 
for  the  total  group,  given  in  Table  2,  are  found  by  summing  the  frequencies 
for  the  two  groups  and  then  proceeding  to  scale  as  for  *>he  single  groups. 

When  Mosteller's  (15)  chi-square  test  for  goodness  of  fit  is  applied 
to  these  data  one  finds  (see  Table  5,  )  a  chi-square  of  about  74 

for  the  Chicago  data,  ?6  for  the  Princeton  data,  and  127  for  the  two 
groups  combined.  The  corresponding  p  -values  are  each  less  than  .0001, 
the  chi-square  value  at  the  .01  level  being  only  48.  Thus,  the  con¬ 
clusion  reached  would  he  that  the  data  are  not  fully  accounted  for  by 
the  law  of  comparative  judgment.  However,  it  is  interesting  and 
meaningful  to  know  whether  the  fraction  of  the  systematic  variation 
which  is  not  accounted  for  should  be  regarded  as  approximately  1  or 
2  percent  or  as  much  as  75  percent.-  For  example,  if  an  aptitude  test 
has  a  validity  coefficient  of  .5  for  predicting  some  criterion,  it  is 
considered  a  very  useful  test,  even  though  it  is  also  true  that  75  per¬ 
cent  of  the  variance  in  the  criterion  is  not  accounted  for  by  the  test. 
Under  such  circumstances  it  would  doubtless  he  true  that  the  criterion 
contains  a  significant  non-random  component  that  is  different  from 
anything  represented  by  the  test.  Analysis-of -variance  and  variance- 
component  analysis  procedures  will  give  information  on  the  percentage 
of  the  variance  which  is  accounted  for  and  on  the  percentage  which 
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TABLE  1 

Experimental  Proportions  (p) 


Bandwriting 
Specimens  - 

— - 

►  50a 

50b 

50c 

70a 

70b 

70c 

80a 

80b 

80c 

— t- 

c 

.52 

.67 

.95 

.99 

.98 

.99 

.97 

.94 

50a 

p 

— 

.52 

.66 

.88 

.98 

.98 

.97 

.83 

.86 

c 

.48 

.60 

.85 

.95 

.96 

.98 

.98 

•95 

50b 

p 

.48 

— 

.60 

.69 

.97 

.96 

.93 

.94 

.91 

G 

.53 

.40 

.76 

.78 

.92 

•  91 

.86 

.96 

50c 

P 

.34 

.40 

a.  mm 

.70 

.82 

.94 

.92 

.84 

.93 

C 

.05 

.15 

.24 

mm  m. 

.76 

.87 

.95 

.79 

.78 

70a 

P 

.12 

.31 

.30 

— 

.78 

.84 

.91 

.70 

.83 

c 

.01 

.05 

.22 

.24 

or  m 

.74 

.80 

.52 

.71 

70b 

P 

.02 

.03 

.18 

.22 

m  mm 

.64 

.78 

.37 

.61 

c 

.02 

.04 

.08 

.13 

.26 

.59 

.26 

.56 

70c 

P 

.02 

.04 

.06 

.16 

.36 

mm  mm 

.71 

•  30 

.58 

c 

.01 

.02 

.09 

.05 

.20 

.41 

mm  «m 

.15 

.31 

80a 

p 

.03 

.07 

.08 

.09 

.22 

.29 

mm  mm 

.15 

.38 

c 

.03 

.02 

.14 

.21 

.48 

.74 

.85 

mm  mm 

.61 

80b 

p 

.17 

.06 

.16 

.30 

.63 

.70 

.85 

mm  mm 

.,70 

c 

.06 

.05 

.04 

.22 

.29 

.44 

.69 

.39 

Mi  mm 

80c 

p 

.14 

.09 

.07 

.17 

.39 

.42 

.62 

.30 

-- 
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TABLE  2 

Scale  Values  for  Handwriting  Specimens 
50,  50b  JO.  10.  80b  70b  80o  70c  80a 

Chicago  0.000  0.210  0.657  1.179  1-799  1-758  2.054  2.169  2-472 

Princeton  0.000  0.107  0.584  O.808  1.252  1-578  1.690  1.794  2.048 

C 

Total  0.000  0.147  0  >92  0.958  1*473  1.624  1.833  1*949  2.213 

Group 

[Probability  of  choice  approximately  given  by  difference  of  scale 
values  interpreted  as  a  unit  (standard)  normal  deviate,  fitted 
according  to  Thurstone  (l4,  15)  Hosteller  (13)3 


TABLE  3 

Theoretical  Proportions  (p*)  Computed  from 
Scale  Values  in  Table  2 


50a 

50b 

50c 

70a 

70b 

70c 

80a 

80b 

80e 

c 

.583 

.744 

.881 

•  959 

.985 

.993 

.964 

.980 

50a 

P 

-- 

.542 

.650 

.790 

.943 

.964 

.980 

.895 

•  955 

c 

.417 

.673 

.834 

.937 

.973 

.988 

.944 

.967 

50b 

P 

.458 

*n  am 

.609 

.758 

.929 

.954 

.974 

.874 

.943 

C 

.256 

.327 

.699 

.860 

.935 

.965 

.873 

50c 

P 

.350 

.391 

— 

.664 

.884 

.921 

.952 

.807 

.904 

c 

.119 

.166 

.301 

■■  m 

.712 

.839 

.902 

.732 

.809 

70a 

P 

.210 

.242 

.336 

mm  ■» 

.780 

.838 

.893 

.672 

.811 

C 

.041 

.063 

.l4o 

.288 

mm  mm 

.667 

.769 

.524 

.624 

70b 

P 

.057 

.071 

.116 

.220 

mm  mm 

.585 

.681 

.372 

.545 

c 

.015 

.025 

.065 

.161 

.333 

mr*  m* 

.619 

.356 

.454 

70c 

p 

.036 

.046 

.079 

,162 

.415 

— 

.600 

.294 

.459 

c 

.007 

.012 

.035 

.098 

.231 

.381 

am  mm 

.251 

.338 

80a 

p 

.020 

.026 

.048 

.107 

.319 

.400 

— 

.213 

•  360 

c 

.036 

.056 

.127 

.268 

.476 

.644 

.749 

mm 

.601 

80b 

p 

.105 

.126 

.193 

.328 

.628 

.706 

.787 

am  mm 

.669 

c 

.020 

.033 

.081 

.191 

.376 

.546 

.662 

.399 

— 

80c 

p 

.045 

.057 

.096 

.189 

.455 

.541 

.640 

.331 

C  *  Chicago  data 
P  «  Princeton  data 
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remina  to  be  accounted  for  after  the  lav  of  comparative  Judgment  has 
been  utilized,  and  will  thus  give  coefficients  which  are  analogous 
to  "reliabilities."  For  various  illustrations  of  analysis  of  components 
of  variance  see,  for  example ,  Mood  (12),  Bennett  and  Franklin  (2), 

Chapter  7,  Davies'  (4)  discussion  of  "expectation  of  mean  square"  beginning 
in  Chapter  4,  Duncan  (5)*  especially  Chapters  23  and  24,  or  Tippett’s 
(1 6)  discussion  of  substantive  variances  in  Chapters  6  and  7* 

Framework  of  the  analysis. 

Since  ve  are  dealing  with  proportions,  the  sampling  variance  is 
a  function  of  the  true  proportion  as  veil  as  of  the  sample  size. 

[K0p  «  x(l  -  *)]  •  If  the  analysis  is  conducted  in  terms  of  an  angular 
transform  of  each  proportion,  then  the  (binomial)  sampling  variance  is 
a  function  primarily  of  N  ,  and  not  of  the  true  proportion.  The 
angular  transform  of  the  data  is  defined  on.  different  scales  by  different 
authors.  The  simplest  scale  for  our  purposes  is  that  used  by  Bald  (9) 
in  Ms  table,  where 

0*2  arc  sino/iT  (the  arc  is  expressed  in  radians). 

The  variance  of  ©  is  l/N  approximately,  for  proportions  not 
too  near  1  or  0.  If  Np  and  H(l  -  p)  both  exceed  4  or  5  the  approxi¬ 
mation  is  quite  good.  Even  more  extreme  cases  may  be  analyzed  by  the 
use  of  the  averaged  angular  transformation,  Freeman  and  Tukey  (8),  which 
will  be  satisfactory  for  Np  ,  N(l  -  p)  >  1  .  In  the  other  common 

version,  tabled  by  Fisher  and  Yates  (7)# 

a  a  arc  sin \/jT  (the  arc  is  expressed  in  degrees). 


The  variance  of  8,  Is  approximately  821/n  for  proportioao 

.j,  _  cu\  @  ss  it/ 2  «*  1.57^6  >  vhile 

not  too  close  to  1  or  0.  Thus  if  P  *5°  t  ' 

g  *  45.00  .  In  general. 


„  45 .00  a  —  Q 

^ ' 


If  tables  of  a,  are  used,  then,  In  order  to  fit  Into  the  Pattern  of 
Table  k,  the  resulting  sums  of  squares  should  be  divided  by  821. 

The  convenience  of  an  analysis  in  terms  of  6  -values  lies  in 
the  fact  that  for  gre  binomial  variation  the  variance  of  any  8  is 
substantially  equal  to  the  reciprocal  of  the  number  of  observations  on 
which  the  p  is  based.  This  property  of  the  angular  transformation 
allows  the  definition  of  modified  chi-squares,  such  as  the  one  used  by 
Hosteller,  which  do  not  require  denominators.  When  necessary,  «  shall 
distinguish  these  modified  chi-squares  as  apgulsr  chi-squares. 

For  each  ordered  pair  of  stiimlli  (  i,J  )  ve  have  an  observed 
angle  8  corresponding  to  the  observed  p  's  of  Table  1,  and  a  fitted 

angle  8*  derived  from  the  fitted  scale  end  corresponding  to  the 

fitted  p*  -a  of  Table  5.  Because  of  the  symmetry  of  the  situation 
the  mean  of  the  complete  set  of  P  'a,  or  that  of  the  F*  'a,  is  .50. 
Correspondingly,  the  mean  of  any  complete  set  of  8  -a  and  the  mean 

of  any  complete  set  of  8s  '  6  equals  1.5708. 

Using  angles,  the  analysis  of  variance  is  given  in  terns  of  the 

following  definitions: 

8  .  2  arc  sin  JF  <ot8er'red  volUeB) 

9*  =  2  arc  sin  J?  valueB) 

S  *  1.5708  =  2  arc  sin  *r3 
(the  arc  is  measured  in  radians). 
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If  all  the  stimuli  are  identical,  and  cure  Judged  to  be  identical,  then 
the  proportion  of  Judgments  "  i  greater  than  J  "  would  be  .5  in  every 
case. 

We  treat  the  observed  angles  0  as  if  they  were  a  sum  of  three 
types  of  contribution.  This  treatment  is  approximate  in  two  ways. 

First,  as  Mosteller,  (13,  p.  213)  was  careful  to  point  out  in  connection 
with  his  chi-square,  the  fitting  used  is  a  least-square  fit  on  the 
normal  scale  but  not  on  the  angular  scale.  Consequently,  residuals 
on  the  angular  scale  will  not  be  as  small  as  those  resulting  from  a 
fitting  procedure  tailored  to  the  angular  scale.  As  a  consequence, 
our  estimated  "reliability"  coefficients  will  be  somewhat  smaller. 

Just  as  Mosteller' s  chi-squares  are  somewhat  larger,  than  those 
obtainable  from  more  closely  tailored  fits,.  Second,  the  imperfect 
linearity  of  the  relation  of  angles  to  normal  deviates  means  that  the 
true  scale  difference  for  any  pair  compared  is,  when  measured  in  angles, 
only  approximately  a  difference.  For  the  purpose  of  defining  variance 
components  and  reliabilities  this  latter  effect  should  not  be  quanti¬ 
tatively  important.  We  shall  use  these  approximations  freely,  usually 
without  further  ado.  (We  hope  to  return  to  their  consideration,  as 
well  as  that  of  other  refinements  of  procedure,  in  another  paper.) 

Let  us  return  to  the  three  types  of  contribution  associated  with  a 
single  comparison  (as  of  two  specimens  of  handwriting)  and  contributing 
to  the  observed  angle. 

One  contribution  is  approximately  the  difference  between  the  true 
scale  values  for  the  two  stimuli,  (say  si  -  s .  ).  These  s  values 


may  be  thought  of  as  drawn  from  a  population  with  variance  a?  , 

8 

Hence  the  values  in  the  cells  (  s^  -  s^  )  are  regarded  as  drawn  from 

a  population  with  variance  202  . 

s 

Another  is  a  deviations  component  (designated  d  )  due  to  the 

deviations  of  the  data  from  the  linear  scaling  model  used.  These 

d  -values  are  drawn  from  a  population  with  variance  of  . 

a 

Due  to  the  fact  that  we  are  dealing  with  values  determined  from 

proportions,  we  have  a  binomial  error  component  (say  b  ).  These 

values  are  drawn  from  a  population  with  variance  of  . 

b 

Thus  we  have  the  approximate  composition  of  the  observed  values 
and  the  associated  variance  of  the  population  from  which  each  of  these 

sx) 

three  quantities  may  be  thought  of  as  drawn,  as  follows: 

•u 1  (,i  -  *j> +  dn  *  bu 

The  population  variances  of  these  three  components  are  respectively 

2 

and  .  When  the  data  are  analyzed,  the  deviation  of  the 
observed  0  from  their  mean  (designated  §  )  is  easily  separated  into 
two  parts,  one  a  linear  component  in  agreement  with  the  law  of  compara¬ 
tive  Judgment,  the  other  a  residual  component,  as  follows: 

(eu  • S)  -  (eij  -  9> +  <eu  •  «?p 
total  linear  residual 

Correspondingly  we  have  the  three  sums  of  squares. 

Total  8-siJ  (e.,  -  8  f 

T  PJ  J 

SL  *  ’  S>2 


Linear 
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Residual 


°» -*  \ve" 


It  may  be  noted  that  s  ,  d  ,  and  b  all  affect  the  linear  component 
(and  also  the  total),  while  the  residual  is  not  affected  by  s  ,  but 
only  by  d  and  b  .  This  separation  can  now  be  used  as  the  basis  for 

an  analysis  of  variance. 

Because  of  the  nature  of  the  fitting  process,  and  because  of  the 
slightly  non-linear  relation  between  angles  and  normal  deviates,  the 
deviations  of  the  observations  from  their  means  have  been  separated 
into  two  parts  which  are  not  formally  "orthogonal."  There  is  no 
necessity  for 


=  <eu 

>-h  J 


8) 


to  vanish.  Consequently  the  two  expressions  for  the  sum  of  squares 
associated  with  the  fit  according  to  the  law  of  comparative  Judgment, 


iz&U  -  s>2  2  h 

and 

^<91J  -  S>2  -  -  ®1J)2  2  ST  -  SD  ' 

need  not  be  precisely  the  same.  So  long  as  these  give  substantially 
the  same  answer,  we  may  use  either  or  S^,  -  Sjj  in  assessing  a 
"reliability"  without  serious  error.  (Should  they  differ  widely,  re- 
consideration  of  the  fitting  would  be  in  order . ) 

The  linear,  residual,  and  total  mean  squares,  together  with  the 
number  of  Judges  (  N  )  and  the  number  of  stimuli  (  k.  ),  may  be  used 


* .  -****••»■» 


N 
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to  give  estimates  of  the  variances  as  follows: 

2S, 


T  ,  2  2  2\ 

total  mean  square  T  =  «  est  (2cJfi  +  od  +  of) 

2Sd  2p 

residual  mean  square  D  =  ^  rry =  est  <0d  +  V 


binomial  mean  square 


linear  mean  square 


1  .  2 
if  -  est  % 


L  =  =  eat  (ko^  +  4  +  <^) 


It  should  be  noted,  as  pointed  out  above,  that  we  also  have  another 
possible  value  for  the  linear  mean  square  given  by 


g(V p>  +  0  *  £■.  1°  - est  <kos +  °d +  & 


We  may  also  define  an  associated  set  of  chi-squares  as  follows: 


^  ‘  ®L 

*L>  *  N(ST  -  SD>  ’  4-®D  • 

The  basic  formulas  for  the  associated  analyses  are  summarized  in 
Table  4. 

.  »  . 

Starting  with  the  observed  values  (  p  )  and  fitted  values  (  p*  ) 
the  values  of  9  and  ©*  are  found.  These  are  used  to  compute  , 
SD  ,  and  ,  the  sums  of  squares.  From  these  we  get  the  mean  square 
values  designated  T  ,  V  ,  and  L  .  These  are  used  to  give  the 
estimates  of  variance  components  and  "reliabilities." 

The  application  of  the  procedure  indicated  in  Table  4  to  the  data 
of  Table  1  gives  the  results  indicated  in  Table  5.  In  Table  5  the 


Outline  of  Analysis  of  Variance 


values  obtained  for  the  Chicago  group  are  indicated  by  (C),  the  values 
for  the  Princeton  group  by  (P),  and  the  values  obtained  by  pooling  the 
numbers  of  judgments  for  the  two  gioups  are  indicated  by  (T).  The 
data  on  baseball  teams  presented  by  Mosteller  (13)  is  indicated  by  (B). 

The  results  show  consistency  in  the  variance  components.  Three 
estimates  of  the  linear  component  are  available  in  the  handwriting 
experiment,  0.3521  (Chicago),  0.2 868  (Princeton),  and  0.3H5  (combined). 
Three  estimates  are  similarly  available  of  a  "deviations  from  scalability 
component,  0.0166  (Chicago),  0,0171  (Princeton),  and  0.0176  (combined). 

In  comparison  with  the  linear  component  the  deviations  components  are 
small  and  agree  unusually  well  among  themselves.  This  fact  suggests 
that  we  have  systematic  and  consistent,  though  small,  deviations  from 
the  law  of  comparative  Judgment. 

Variance  ratios. 

In  dealing  with  psychological  tests  many  different  sets  of  variance 
ratios  have  been  used,  giving  various  types  of  validity  and  reliability 
coefficients  each  having  somewhat  different  properties  and  serving, 
somewhat  different  purposes.  In  general  theBe  coefficients  axe  the 
ratio  of  a  measure  of  "true  variance"  to  a  measure  of  "observed  variance" 
which  includes  both  "true  variance  and  error  variance."  One  reasonable 

interpretation  for  paired  comparisons  is  to  regard  the  linear  component 

o  N  2  2  x 

(  2o  )  as  "true  variance"  and  the  other  two  components  (  0d  +  %  >  aa 
s 

error  variance,  so  that  we  may  define  a  coefficient  of  "linear  consis¬ 
tency"  through 


.*»«***,•  *aj*kumu  inwHi mu, «  frutiwnmm  n  -•**&**•« 
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TABLE  5 

Comparison  of  Seeding  Bata 
Analysis  of  Variance 


Source  of 
variation 


Degrees  of 
freedom 
(df) 


Sum  of 
squares 


Mean 

square 


Angular 

chi-square 


P 


All 

36 

28 

r 

ST 

26.7717 

21.7404 

23.6709 

3.3468 

T 

•7437  (c) 

.6039  (p) 

.6575  (T) 

.1195  (B) 

4 

26Tl.il  " 
2174.04 
4734.18  _ 

73.63 

"  (<.00001) 

(<.0001) 

SL 

T  ST  “ 

L  k  -  1 

►4 

B 

11 

f  25.5606 

3.1951  (3.2534' 

>  (c) 

2556.06 

8 

20.8661  2.6083  (2.6229' 

>  (P) 

2086.61 

►  (<.00001) 

Linear  scale 

22.6075  2.8259  (2.8796; 

1  T) 

4521.50  J 

7 

2.6813 

.3830  (  0.3822)  (B) 

58.99 

(<.0001) 

O. 

1) 

D 

r  .7449 

.,0266  (C) 

74.49  1 

28 

< 

•7575 

.0271  (P) 

75.75 

r  (<.0001) 

Residual 

^ .6338 

.0226  (T) 

126.76  J 

21 

.6717 

.0320  (B) 

14.78 

(.80+) 

Estimated  Variance  Components 

2 

Linear  scale  values,  o 

Deviations 

from 

Binomial  variation,  0^ 

*  L~D 

'  - 

T-D 

scalability,  a. 

TT 

~T 

(c) 

.5521 

,3585 

.0166 

.0100 

(P) 

.2868 

.2884 

.0171 

.0100 

(T) 

.3115 

.3174 

.0176 

.0050 

(B) 

.0439 

.0437 

-.0135 

.0455 

Estimated  Reliabilities 

p 


(C) 

P) 

(T) 


(B) 


rs 

2(L-D) 

r88 

1  -  2 

ree* 

rb 

1  _  _i 

kT 

x  «p 

KT 

.9468 

,q642 

.9723 

.9866 

.9498 

.9551 

.9652 

.9834 

.9475 

.9656 

•9733 

.9924 

(r2  =  .956) 

.7343  .7322 

•7993 

(rx  =  .9 
.6192 

(C) 

(P) 

(T) 

(B) 


Chicago  data 
Princeton  data 


k 

9 

9 


N 

100 

100 


These  two  together  9  200 

Baseball  data  from  8  22 

Hosteller  (13) 
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The  factor  2  arises  from  the  fact  that  ag  was  normalized  in 

2  2 

terms  of  individual  stimuli,  while  and  are  normalized  in 

terms  of  differences.  That  is,  a2  is  the  variance  of  the  k  different 
s  -values,  while  the  variance  of  the  k(k  -  l)  values  (  “  8j  ) 

is  2a2  ,  and  the  observed  variance  for  the  cell  entries  is 

<  2<rs  +  °d  +  °b  )* 

If  the  linear  sum  of  squares  is  taken  as  ST  -  (instead  of 

S  ),  then  we  have  another  estimate  for  the  coefficient  of  linear  con- 
L 

s  latency. 


>«? 


SB 


T  -  D  £  B 

T  "  T”?  2  5 

2'b  +  °d  +  “b 


These  coefficients  r  and  r  indicate  the  extent  to  which  the 
linear  model  (as  represented  by  the  fitted  values  ©*  )  fits  the  observed 
cell  ent vies,  given  by  0  .  For  example,  if  the  agreement  is  perfect, 
then  SD  and  D  will  equal  zero,  ST  will  equal  ^  which  means  that 
2L/k  =  T  so  that  rg  =  rgB  =  1.00  .  If,  on  the  other  hand,  the  mean 
squares  T  ,  L  ,  and  D  are  all  equal,  then  rg  =  rgs  «  0.00  .  These 
coefficients  r  and  r  are  regarded  as  similar  to  ,  the  square 

S  BIS 

of  the  correlation  between  observed  and  true  values  assuming  the  linear 
model.  Alternatively,  and  TgB  may  be  regarded  as  representing 

the  correlation  between  two  sets  of  observed  values  provided  their 
correlation  is  entirely  accounted  for  by  the  true  values  (assuming  a 
linear  model).  The  coefficients  rg  or  rss  may  be  regarded  as 
appropriate  to  the  recomparison  of  a  randomly  selected  pair  of  the  nine 
handwriting  specimens  against  a  background  of  seven  other  specimens 


covering  the  same  range  of  merit  and  hence  drawn  from  a  population 
having  the  sane  02  as  the  specimens  used  in  this  experiment.  For 
example,  if  another  set  of  throe  specimens  each  of  values  50,  70,  and 
80  were  scaled,  a  similar  02  would  he  expected;  if  cr^  and  02  also 
remained  about  the  same,  a  similar  degree  of  agreement  between  fitted 
and  observed  values,  i.e. ,  a  similar  coefficient  of  linear  consistency, 
would  be  expected. 

However,  if  all  the  handwriting  specimens  (from  10  to  90)  in  the 
Ayres  Scale  were  used,  one  would  expect  a  larger  0^  ,  and  if,  as  seems 
plausible,  0^  remained  about  the  same,  the  result  would  be  a  higher 
coefficient  than  that  found  here  using  only  values  50,  70,  and  80.  On 
the  other  hand,  if  one  vised  only  specimens  JQ,  60,  and  70,  a  slightly 
smaller  02  and  (if  0^  remained  about  the  same)  a  slightly  lower 
reliability  would  be  expected. 

It  can  be  seen  that  even  though  Hosteller' e  chi-square  goodness 
of  fit  test  (  )  shows  clearly  that  the  handwriting  data  deviates 

significantly  from  a  linear  scale,  nevertheless  the  scales  show  a 
satisfactory  agreement  with  the  linear  model,  about  .95  for  the  case 
where  the  nine  handwriting  specimens  were  rated  by  100  or  200  Judges. 
Since  only  2c2  is  considered  to  be  true  variance,  the  coefficients 

D 

given  by  r  and  r  will  be  what  are  usually  termed  "conservative” 

8  S3 

estimates.  A  "dashing"  estimate  for  reliability  is  obtained  by 
regarding  0^  as  part  of  the  true  variance  rather  than  as  part  of 
the  error  variance.  Thus  we  have 


c 
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_  2  2 

2a  +  o, 
s  d 

=  0  5  2  5 

2os  +  °a  +  “b 


1  N 


=  r. 


This  definition  yields  for  the  handwriting  data  reliabilities  of  .98  or 

.99.  This  coefficient  represents  the  correlation  between  two  sets  of 

0  -values  for  the  same  stimuli  Judged  by  another  random  sample  of 

people  from  the  same  population.  Coefficients  computed  from  this 

formula  are  appropriate  to  the  recomparison  of  a  randomly  selected  pair 

of  the  nine  specimens  against  a  background  of  seven  other  handwriting 

2 

specimens  drawn  from  a  population,  having  the  same  a^  and  also  the 
same  peculiarities  that  produced  the  deviations  from  linearity.  One 
possibility  is  a  recomparison  of  a  random  pair  against  a  background 


of  the  same  seven  other  handwriting  specimens.  Thus  we  see  that  without 
any  assumptions  about  the  law  of  comparative  judgment  one  has  a  set 
of  stimuli  that  cannot  be  regarded  aa  indifferent  to  the  subjects. 

A  corresponding  chi-square  is  given  by 


x?.  »  NSj, 

with  degrees  of  freedom 
df  =  (k/2)(k  -  1) 

These  values  of  chi-square  (  in  Table  5)  are  all  extremely  large, 
indicating  a  negligible  probability  that  the  data  could  have  arisen  by 
random  sampling  from  a  population  in  which  the  proportions  were  all  •  5* 
The  coefficient  (  ),  which  is  aero  if  the  percentages  of  Table  1 

are  all  random  binomial  deviations  from  »5>  nay'  b®  compared  with 
Kendall’s  coefficient  of  agreement  (3.0,  pp.  125ff.;  11#  pp.  333ff.)> 

.  .  .  . ... ... .  . , 


f  S 
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which  is  unity  only  if  all  proportions  are  1.0  or  0.0;  l.e.,  if  there 
is  complete  agreement  among  all  Judges  in  making  each  Judgment.  Kendall ' a 
coefficient  of  agreement  is  determined  directly  from  the  experimental 
frequencies,  without  using  any  transforms  such  as  the  arc  sin.  The  data 
-5*  here  presented  cannot  be  regarded  as  showing  such  agreement  among  all 

^  judges.  However,  it  clearly  cannot  be  regarded  as  Indicating  only 

random  judgments. 

We  may  compare  these  coefficients  computable  fo:r  'n  single  set  of 
data  with  more  conventional  reliabilities  obtained  by  comparing  the 
Princeton  with  the  Chicago  scale  values.  The  correlation  between  the 
two  sets  of  values  in  Table  2  (  )  is  .989,  which,  it  may  be  noted, 

is  similar  in  magnitude  to  r^  .  If  ve  make  no  allowance  for  changes 
in  discrlminal  dispersion,  but  take  the  entire  difference  of  scale 
values  (adjusted  to  a  common  mean  but  not  to  a  common  variance)  as 
error,  then 

r2  =  1  -  -4^— -  .956 

Ex"  +  Ey 

which  is  similar  in  magnitude  to  the  estimates  of  pg  . 

Two  coefficients  have  been  suggested.  The  coefficient  r.fe  indicates 
the  extent  to  which  the  stimuli  are  differentiated  by  the  subjects. 

It  seems  reasonable  to  regard  r  or  r  as  a  conservative 

s  ss 

;  estimate  of  consistency  for  a  single  set  of  data  scaled  by  the  law  of 

« 

r  comparative  judgment.  In  such  a  case  there  would  be  no  replication  to 

2 

indicate  that  might,  from  some  points  of  view,  reasonably  be  re¬ 
garded  as  part  of  the  true  variance.  The  estimates  r  and  r  give 

s  ss 
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a  direct  measure  of  the  agreement  between  the  observed  (  0  )  and 
fitted  (  0*  )  values  of  the  ere  sin  «/p  . 

The  lines  labelled  "(B)"  in  Table  5  give  for  comparison  the  data 
on  baseball  teams  reported  by  Mosteiier  (13).  It  is  interesting  to 
note  that  despite  the  non-significant  chi-square,  the  reliability 
(  r  or  r  )  is  only  ,T5>  while  r.  =  .62  .  This  low  reliability 

a  SB  D 

is  due  apparently  to  the  similarity  of  the  different  teams,  since 
est  is  only  .0439,  which  is  less  than  the  binomial  variation  of 
.0455  with  which  of  must  be  combined.  Under  these  circumstances  it 
is  not  surprising  that  chi-square  is  not  significant,  especially  with 
N  as  low  as  22  „  On.  the  other  hand,  the  data  on  handwriting  has  a 
smaller  binomial  variance  (.01),  and  a  much  larger  a2  (about  .3). 
Despite  the  fact  that  the  residual  mean  square  (  D  )  is  slightly  smaller 
than  that  for  the  baseball  data,  when  N  equals  100  or  200  with  28 
degrees  of  freedom,  this  much  smaller  discrepancy  cannot  be  regarded 
as  due  to  chance. 

In  summary,  a  variance-components  analysis  has  been  presented  for 

paired  comparisons .  This  analysis  gives  estimates  of  the  variance  of 

2 

the  actual  scale  values  (  og  ),  and  the  variance  of  observations  due 

O 

to  deviations  of  the  data  from  the  linear  paired  comparisons  model  (  cr^  ) 
which  are  compared  with  the  binomial  sampling  variance  (  ).  A  variety 

of  coefficients  based  on  these  three  variances  are  also  presented.  If 
one  is  interested  in  asking  whether  or  not  the  subjects'  responses  are 
purely  random,  then  Kendall's  coefficient  of  agreement,  or  the  r.fa  as 
presented  here  may  be  used.  If  one  is  interested  in  the  extent  to  which 
the  law  of  comparative  judgment  accounts  for  the  data,,  then  r  or  r 
would  be  the  appropriate  coef f iciont . 
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